Preliminaries

The question under investigation


In the Ukrainian language, as in Polish, Belorussian and Russian languages, there is a particle that may optionally mark Yes/No questions - чи, czy and ли, respectively.

  – Чи ти , Стехо , здорова ? ‘Are you good, Stecha?’ [Олександр Кониський:Наймичка,1874]

Which well could have looked like:
  Ти , Стехо , здорова?

It may not be used with Wh-Question words (‘How’, ‘Why’ and so on):

  – А ви чого, дурні, радієте? ‘But why are you happy, fouls?’ [Олександр Кониський:Наймичка,1874]
not:
  – Чи ви чого , дурні , радієте?

 Чи, as its counterparts in other languages in the discussion, may be also used as a disjunctive connector (as whether … or in English).

 The definitive reasons for the use of this particle in cases when it is not obligatory are not fully clear, however, there seem to be regional, temporal and other biases in usage. In my analysis, I would like to explore these possible interactions.

Data used

 For this project, I used GRAC corpus: [Maria Shvedova, Ruprecht von Waldenfels, Sergiy Yarygin, Mikhail Kruk, Andriy Rysin, Michał Woźniak (2017-2018): GRAC: General Regionally Annotated Corpus of Ukrainian. Electronic resource: Kyiv, Oslo, Jena. Available at uacorpus.org].

 From there, I extracted questions using the following CQL-query:

<s> []{1,15} [word =="?"]

 Then, I wrote a python script to extract question particles, such as А, Чи, Невже from the questions. A sample output:

## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
AUTHOR TITLE YEAR COUNTRY GENRE REGION CITATION QWORD
Іван Карпенко-Карий З Івана-пан; а з пана-Іван 1884 UA DRA UA-KRV; А що ж зо мною буде ? QWORD
Іван Карпенко-Карий З Івана-пан; а з пана-Іван 1884 UA DRA UA-KRV; Вже ж як ви знову пан , то я Іван і Оксанка вам не нужна ? QWORD
Іван Карпенко-Карий З Івана-пан; а з пана-Іван 1884 UA DRA UA-KRV; Где же Иван ? NOQWORD
Іван Карпенко-Карий Чортова скала 1884 UA DRA UA-KRV; А ти чого лежиш ? QWORD
Марко Вовчок Горпина 1861 UA FIC RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; Чи ви ж доглядали її , батеньку ? CZY
Марко Вовчок Горпина 1861 UA FIC RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; Скажіть же бо , що й як ? QWORD
Марко Вовчок Горпина 1861 UA FIC RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; — А чого , дочко , мене лякаєш ? QWORD
Марко Вовчок Горпина 1861 UA FIC RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; « Що се з нею подіялося ? QWORD
Марко Вовчок Горпина 1861 UA FIC RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; Цілісінький день ходить мовчки та городній мак ізбирає ; а спитати , нащо ? QWORD
Марко Вовчок Данило Гурч 1861 UA FIC RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; – А що таке ? QWORD
Марко Вовчок Інститутка 1861 UA FIC RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; А ти , Прокопе , чому не йдеш ? QWORD
Марко Вовчок Інститутка 1861 UA FIC RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; Чи , може , ця краля ? CZY
Марко Вовчок Інститутка 1861 UA FIC RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; Лиха наша пані молода ? NOQWORD
Марко Вовчок Інститутка 1861 UA FIC RU;UA-HRK;UA-CRG;UA-KYV;UA-VNC; — Яковось-то жилося тобі , серденько , самій ? NOQWORD

Comments on the data files used

 Below is the information about data files that were used to perform this research: both original data file produced by a python script (extract shown above) and the summary tables used for further plots (find interactive plots below).
Author: The writer of a given text.

Title: The title of a given text.

Year: The publication year of a given text.

Country: The primary country of writer’s activity. As of now, not used in the analysis.

Genre: The genre of a text. Fiction texts represent the main share of our database, while poetry is excluded from it completely.

Region: The main regions of author’s activity, as filled by creators of GRAC corpus.

Citation The question sentence extracted from GRAC corpus (kwic). In original outcome file, there are a lot of “false” dublicates, that is, identical strings for one title of a given author that doubled due to an unindentified error in GRAC engine. However, there sometimes were “true” dublicates, that is, identical questions found in one text, such as Невже це так? In our calculations, we used only unique question strings, omitting both “false” and “true” dublicates.

Qword Type of the question string, as determinded by the python script.

 1.QWORD - Any type of Wh-Word (‘how’, ‘where’, ‘why’ and so on) is found in the question string. Even if another question particle (чи, невже a so on) is present there, the question is assigned to QWORD and then eliminated from our analysis.

 2.NOQWORD - Question without any question particle (counted in our analysis).

 3.CZY - One instance of Чи particle found in the question string.

 4.2CZY - Two or more instances of Чи found in the question string. This usually indicates usage of чи as a disjunctive marker (‘or’), cf.:

Чи він тільки сам урятувався , чи ще хто відступив ? ‘Is it only him who escaped, or someone else managed to leave’ (Петро Панч, Облога ночі)

As of now, this usage is completely omitted from our analysis.

 5.HIBA, Nevzhe - Rhetorical question particles Невже and Хiба. As of August ’19 they haven’t been studied within our analysis yet. They can’t be used together with чи, that is why in our analysis we count them as “NOQWORD” instances.

 6.A - Emphasys particle A, that rarely can be used together with Невже or Хiба. In our analysis it is counted as “NOQWORD”.

ALLSUM: Sum of all not-Wh-word questions, including 2CZY questions with two чи, not included in all other calculations. This parameter is used for estimating the size of book processed for question extraction.

Find more information about the Genre shares in our database
Table of Genre attribution of each question token (not text) in our database. This data was imported from GRAC corpus.

## 
##    ACA    CHI    DIA    DRA    ETH    FIC    HIS    JOU    LET    MEM 
##    950   3306    626  24925     43 106049    850   2937     27   3079 
##    PRE 
##      3

ACA - Academical texts.

CHI - Children literature.

DIA - Diaries.

ETH - Ethnographical works (the only author here is Агатангел Кримський).

FIC - Fiction

HIS - Historical literature.

JOU - Journalism.

LET - Letters.

MEM - Memoirs.

PRE - Speeches.


 As shown above, the vast majority of Genres represent FIC (Fiction), so no particular stress is given to GENRE feature, even though it is quite


The main metrics: czyratio and czyperc.

 There are two types of metrics used for this analysis.

 First, the ratio of questions with чи (czyratio). It was used for plots with the writer’s birthyear employed as time-axis. The formula for it is:

\[\frac{CZY}{\sum{NOQWORD+HIBA + NEVZHE + A}}\]  CZY questions are not included in the Sum in denominator in order to amplify the results.

 The second metrics use was more traditional percent of чи-questions - czyperc. This metrics was used for plots with particular book titles used for time-axis. The formula for this metrics is: \[\frac{CZY}{\sum{NOQWORD+HIBA + NEVZHE + A + CZY}}\]

 Beware that 2CZY questions, those with чи as disjunctive connector, are excluded from denominators in both metrics, as the semantics of this use requires the obligatory use of чи.



Overview & Hypothesis

 A distinctive difference is found between a group of writers born in western Ukraine and central &southeastern regions. The most noticeable divergence is observed among the authors born in late XIX-century, with a productive period in 1920-1940ies. Their works demonstrate a much higher share of questions with czy than the texts by their peers from central &southeastern Ukraine. We suppose that this is due to a Polish influence, as in Polish the cognate particle czy is very common in not Wh-word questions.

 However, the data of writers of the previous generation (around 1850) does not differ from the observations of their peers. That is quite peculiar, as the sociolinguistic situation in western Ukraine was different from the one in eastern regions.

 Regarding the writers from central Ukraine, the general downward trend is vividly illustrated by a sharp decrease of czy-questions: from a common question particle in XIXth century, to one of a more marginal state in XXth century. In my opinion, this pattern reflects the similar process in the Russian language.

 However, in works by authors from central or southeastern Ukraine we may see that this process slowed down or even ended approximately around 1940. And share of questions with чи started to rise gradually. The individual lifespan data of selected authors (see below) demonstrate that such a trend existed in second part of XXth century.

 The scatterplot depicting the connection between czyperc in individual titles (books) and year show a quite similar to a previous one picture, which is quite expected, as this metric is mainly a proxy of the authors one.

Results


Datafile by writer

The code that was used to make the table above
giant_ukrdata_after45_notTransl_table <- as.data.frame.matrix(table (giant_ukrdata_after45[QWORD!="QWORD"&TRANSLATOR=="",QWORD,by=AUTHOR]), keep.rownames = FALSE)

Plots by writer

 Our data shows that there is a regional & bias in use of чи particle. The scatterplot below depicts the between the birthyear of the author and proportion of questions with чи.

ggplot(writerstable[BIRTHYEAR!="NA"&BIRTHYEAR!="TRANSL"&macroreg!="NA"], aes(x=as.double(BIRTHYEAR), y=as.double(czyratio), color=macroreg)) + geom_point() + ylim(0,1) +  stat_smooth(method = 'loess') 


You can explore the same plot as a interactive plot with extra data (writers born outside Ukraine or undetermined region of birth).

Click here to see

wgplot <- ggplot(writerstable[BIRTHYEAR!="NA"&BIRTHYEAR!="TRANSL"], aes(x=as.double(BIRTHYEAR), y=as.double(czyratio), color=macroreg)) + geom_point(aes(text=sprintf("Author:%s<br>Birthyear:%s<br>Region of birth:%s<br>macroreg:%s", rn, BIRTHYEAR, Birthreg,macroreg))) + ylim(0,1) +  stat_smooth(method = 'loess')
## Warning: Ignoring unknown aesthetics: text
ggplotly(wgplot)
## Warning: Removed 1 rows containing non-finite values (stat_smooth).

 If we make two separate plots with a linear regression line with a cutout point of year 1900, we may see two sepate trends clearer.

Click here to see.
Born before 1900

ggplot(writerstable[BIRTHYEAR!="NA"&BIRTHYEAR!="TRANSL"&macroreg!="NA"&BIRTHYEAR<1900], aes(x=as.double(BIRTHYEAR), y=as.double(czyratio), color=macroreg)) + geom_point() + ylim(0,1) +  stat_smooth(method = 'lm') 


Born after 1900

ggplot(writerstable[BIRTHYEAR!="NA"&BIRTHYEAR!="TRANSL"&macroreg!="NA"&BIRTHYEAR>1900], aes(x=as.double(BIRTHYEAR), y=as.double(czyratio), color=macroreg)) + geom_point() + ylim(0,1) +  stat_smooth(method = 'lm') 

Datafile by title


Data:

R code used to produce this table ` yeartable_all_years_merged_ALL <- as.data.frame.matrix(table (all_merged_ALL[QWORD!=“QWORD”&TRANSLATOR==“”&QWORD!=“”&YEAR!=“”,QWORD,by=YEAR]), keep.rownames = FALSE)

#adding titleyear

all_merged_ALL[, titleyear := paste(TITLE, YEAR, sep = “-”) ]

yeartable_all_years_merged_ALL_titles <- as.data.frame.matrix(table (all_merged_ALL[QWORD!=“QWORD”&TRANSLATOR==“”&QWORD!=“”&YEAR!=“”,QWORD,by=list(titleyear)]), keep.rownames = FALSE) yeartable_all_years_merged_ALL_titles <- as.data.table(yeartable_all_years_merged_ALL_titles,keep.rownames = TRUE) #renaming columns - unifying dataset names(yeartable_all_years_merged_ALL_titles)[names(yeartable_all_years_merged_ALL_titles) == “titileyear”] = “titleyear”

#by title yeartable_all_years_merged_ALL_titles_years <- all_merged_ALL[,unique(YEAR),by=list(titleyear,AUTHOR,GENRE)]

yeartable_all_years_merged_ALL_both <- merge(yeartable_all_years_merged_ALL_titles,yeartable_all_years_merged_ALL_titles_years,by=“titleyear”)

#something new # fwrtbl_part <- full_writerstable3_withyears[,c(“rn”,“BIRTHYEAR”,“Birthreg”,“macroreg”)]

names(fwrtbl_part)[names(fwrtbl_part) == “rn”] = “AUTHOR”

yeartable_all_years_merged_ALL_both2 <- merge(yeartable_all_years_merged_ALL_both,fwrtbl_part,by=“AUTHOR”)

titlestable_all <- yeartable_all_years_merged_ALL_both2 titlestable_all$N <- NULL

titlestable_all_u <- titlestable_all[!duplicated(titlestable_all[,c(‘AUTHOR’,‘titleyear’)]),]

#SAVE such a great dataset write.csv(titlestable_all_u, file = “titlestable_all_u2.csv”, row.names = F) titlestable_all_u <- fread (“titlestable_all_u2.csv”, sep = ‘,’)

#adding genre genre_table_by_title <- all_merged_ALL[,unique(titleyear),by=list(GENRE)]

names(genre_table_by_title)[names(genre_table_by_title) == “PUBYEAR”] = “titleyear”

titlestable_all_u_bygenre <- merge(titlestable_all_u,genre_table_by_title, by=“titleyear”) names(titlestable_all_u_bygenre)[names(titlestable_all_u_bygenre) == “GENRE.x”] = “GENRE”

#adding czyperc and correct sum - input table may be titlestable_all_u or titlestable_all_u_bygenre

titlevec_correct <- as.vector(titlestable_all_u$titleyear)

for (tttitle in titlevec_correct){ ttl <- titlestable_all_u[titleyear==tttitle] no_czy_sum <- NULL no_czy_sum <- sum(ttl[1,4:9]) czyperc_ttl <- ttl[1,5]/no_czy_sum titlestable_all_u[titleyear==tttitle, czyperc:=czyperc_ttl] titlestable_all_u[titleyear==tttitle, ALLSUM:=sum(titlestable_all_u[titleyear==tttitle,3:9])] } #remove antologies with several authors titlestable_all_u_bygenre <- titlestable_all_u_bygenre[titleyear!=“Свідчення очевидців Голоду. Том ІІІ-1985”]

write.csv(titlestable_all_u, “titlestable_all_u2”,row.names = F) `

Plots by title

Interactive scatterplot showing texts having, in total, more than 30 questions without Wh-words and written by authors born in Ukraine.

titlegplot <- ggplot(titletable[BIRTHYEAR!="NA"&BIRTHYEAR!="TRANSL"&macroreg!="NA"&ALLSUM>30], aes(x=as.double(PUBYEAR), y=as.double(czyperc), color=macroreg)) + geom_point(aes(text=sprintf("Author:%s<br>Birthyear:%s<br>macroreg:%s<br>Title: %s<br>Genre: %s", AUTHOR, BIRTHYEAR, macroreg, titleyear, GENRE))) + ylim(0,0.6) +  stat_smooth(method = 'loess')
## Warning: Ignoring unknown aesthetics: text
ggplotly(titlegplot)
## Warning: Removed 3 rows containing non-finite values (stat_smooth).


Another Interactive plot by title without regression line and with eased filter (ALLSUM>10, that is, all texts with more than 10 not-Wh-word questions) and including authors with NA in “Birthreg”, mainly from outside Ukraine.

You can zoom this plot by using “Zoom” or “Box select” tool.

Click here to see

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

Individual scatterplots by author.

We may investigate not the general trend over the region, but the individual authors’ lifespan change of the use of чи particle.

For that, we choose texts of the size larger than 30 not-Wh-word questions (ALLSUM>30) to minimize the error with calculating czyperc. Then, we made scatterplots for the authors who have more than 5 such texts - to be able to say something about their lifespan change. There were 22 such authors in total. The data shows that, generally, the level of questions with чи remains more or less constant over the lifespan of a writer.

However, there are some notable exceptions. First of all, it is Ivan Nechuy-Levitsky, whose works consistently show the downward trend for чи particle use. The similar falling trendline is to see in works by Ivan Karpenko-Karyi. They both are quite similar, as they were born before 1850 (1838 and 1845, respectively) in rural communities. We may speculate that this trend reflects a general tendency of closeness to spoken language.

Another clear example of a lifespan change is Oles Honchar. His numerous works demonstrate clear upward trend, amplified by a striking example of the volumes of his diaries, dated from 1965 to 1993 which distinctly form the upward trend for the чи use.


You may see all 22 individual writer scatterplots below. Note that in order to see the key for the scatterplot (author’s name, Birthreg & so on) you should put your mouse over the point in the scatterplot.

Click here to see. You may need to zoom in and than out the whole page in your browser to display the plots correctly.

##  [1] "Богдан Лепкий"         "Валер'ян Підмогильний"
##  [3] "Володимир Винниченко"  "Іван Карпенко-Карий"  
##  [5] "Іван Микитенко"        "Іван Нечуй-Левицький" 
##  [7] "Микола Трублаїні"      "Микола Хвильовий"     
##  [9] "Михайло Коцюбинський"  "Олександр Довженко"   
## [11] "Олександр Тесленко"    "Олесь Бердник"        
## [13] "Олесь Гончар"          "Олесь Досвітній"      
## [15] "Ольга Кобилянська"     "Павло Загребельний"   
## [17] "Петpo Панч"            "Роман Федорів"        
## [19] "Софія Парфанович"      "Степан Васильченко"   
## [21] "Улас Самчук"           "Юрій Смолич"
## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

## Warning: Ignoring unknown aesthetics: text

Plots by genre

 The table below demonstrates the genre distribution in our dataset. As noted above, in “Preliminaries” section, the superiority of Fiction genre and consequent lack of observations (in this case, writers active in writing books in a given genre) does not allow us to draw a definitive conclusion, however some genre bias is quite clear: Academical works (including Historical) and Journalistic texts are more prone to use чи.

for (ggenre in c("ACA","CHI","DRA","FIC","HIS","JOU","MEM")){
  glen <- length(table(all_merged_ALL[GENRE==ggenre,AUTHOR]))
  print(paste("Number of authors in ", ggenre, " genre: ",glen,sep = ""))
}
## [1] "Number of authors in ACA genre: 12"
## [1] "Number of authors in CHI genre: 10"
## [1] "Number of authors in DRA genre: 21"
## [1] "Number of authors in FIC genre: 135"
## [1] "Number of authors in HIS genre: 13"
## [1] "Number of authors in JOU genre: 30"
## [1] "Number of authors in MEM genre: 19"
ggplot(titletable[GENRE!="LET"&GENRE!="NA"&GENRE!="PRE"&GENRE!="DIA"&GENRE!="ETH"], aes(x=as.factor(GENRE), y=czyperc)) + geom_boxplot(outlier.shape = 19)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).